The 2019-2020 Coronavirus Pandemic Analysis

Contact: Smith Research

BACKGROUND & APPROACH

I wanted to track and trend the coronavirus outbreak on my own curiosity. There are some interesting questions that may fall out of this, as it is a very historic moment, including scientifically and analytically (we have a large amount of data being shared across the globe, analyzed in real-time). The world has come to a halt because of it.
This analysis attempts to answer the following questions (more to come):

  1. What does the trend of the pandemic look like to date?
  2. What are future case predictions based on historical model?
  3. What interesting quirks or patterns emerge?

ASSUMPTIONS & LIMITATIONS: * This data is limited by the source. I realized early on that depending on source there were conflicting # of cases. Originally I was using JHU data… but this was always ‘ahead’ of the Our World In Data. I noticed that JHU’s website was buggy- you clicked on the U.S. stats but it didn’t reflect the U.S.. So I changed data sources to be more consistent with what is presented in the media (and Our World In Data has more extensive plots I can compare my own to). An interesting aside might be why the discrepancy? Was I missing something?
* Defintiions are important as is the idea that multiple varibales accumulate in things like total cases (more testing for example).

SOURCE RAW DATA: * https://ourworldindata.org/coronavirus
* https://github.com/CSSEGISandData/COVID-19/
*

INPUT DATA LOCATION: github (https://github.com/sbs87/coronavirus/tree/master/data)

OUTPUT DATA LOCATIOn: github (https://github.com/sbs87/coronavirus/tree/master/results)

TIMESTAMP

Start: ##—— Mon Jun 1 21:27:01 2020 ——##

PRE-ANALYSIS

The following sections are outside the scope of the ‘analysis’ but are still needed to prepare everything

UPSTREAM PROCESSING/ANALYSIS

  1. Google Mobility Scraping, script available at get_google_mobility.py
# Mobility data has to be extracted from Google PDF reports using a web scraping script (python , written by Peter Simone, https://github.com/petersim1/MIT_COVID19)

# See get_google_mobility.py for local script 

python3 get_google_mobility.py
# writes csv file of mobility data as "mobility.csv"

SET UP ENVIORNMENT

Load libraries and set global variables

# timestamp start
timestamp()
## ##------ Mon Jun  1 21:27:01 2020 ------##

# clear previous enviornment
rm(list = ls())

##------------------------------------------
## LIBRARIES
##------------------------------------------
library(plyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plot.utils)
library(utils)
library(knitr)

##------------------------------------------

##------------------------------------------
# GLOBAL VARIABLES
##------------------------------------------
user_name <- Sys.info()["user"]
working_dir <- paste0("/Users/", user_name, "/Projects/coronavirus/")  # don't forget trailing /
results_dir <- paste0(working_dir, "results/")  # assumes diretory exists
results_dir_custom <- paste0(results_dir, "custom/")  # assumes diretory exists


Corona_Cases.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
Corona_Cases.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
Corona_Deaths.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
Corona_Deaths.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

Corona_Cases.fn <- paste0(working_dir, "data/", basename(Corona_Cases.source_url))
Corona_Cases.US.fn <- paste0(working_dir, "data/", basename(Corona_Cases.US.source_url))
Corona_Deaths.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.source_url))
Corona_Deaths.US.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.US.source_url))
default_theme <- theme_bw() + theme(text = element_text(size = 14))  # fix this
##------------------------------------------

FUNCTIONS

List of functions

function_name description
prediction_model outputs case estumate for given log-linear moder parameters slope and intercept
make_long converts input data to long format (specialized cases)
name_overlaps outputs the column names intersection and set diffs of two data frame
find_linear_index finds the first date at which linearaity occurs
##------------------------------------------
## FUNCTION: prediction_model
##------------------------------------------
## --- //// ----
# Takes days vs log10 (case) linear model parameters and a set of days since 100 cases and outputs a dataframe with total number of predicted cases for those days
## --- //// ----
prediction_model<-function(m=1,b=0,days=1){
  total_cases<-m*days+b
  total_cases.log<-log(total_cases,10)
  prediction<-data.frame(days=days,Total_confirmed_cases_perstate=total_cases)
  return(prediction)
}
##------------------------------------------

##------------------------------------------
## FUNCTION: make_long
##------------------------------------------
## --- //// ----
# Takes wide-format case data and converts into long format, using date and total cases as variable/values. Also enforces standardization/assumes data struture naming by using fixed variable name, value name, id.vars, 
## --- //// ----
make_long<-function(data_in,variable.name = "Date",
                   value.name = "Total_confirmed_cases",
                   id.vars=c("case_type","Province.State","Country.Region","Lat","Long","City","Population")){

long_data<-melt(data_in,
                id.vars = id.vars,
                variable.name=variable.name,
                value.name=value.name)
return(long_data)

}
##------------------------------------------

## THIS WILL BE IN UTILS AT SOME POINT
name_overlaps<-function(df1,df2){
i<-intersect(names(df1),
names(df2))
sd1<-setdiff(names(df1),
names(df2))
sd2<-setdiff(names(df2),names(df1))
cat("intersection:\n",paste(i,"\n"))
cat("in df1 but not df2:\n",paste(sd1,"\n"))
cat("in df2 but not df1:\n",paste(sd2,"\n"))
return(list("int"=i,"sd_1_2"=sd1,"sd_2_1"=sd2))
}

##------------------------------------------

##------------------------------------------
## FUNCTION: find_linear_index
##------------------------------------------
## --- //// ----
# Find date at which total case data is linear (for a given data frame) 
## --- //// ----

find_linear_index<-function(tmp,running_avg=5){
  tmp$Total_confirmed_cases_perstate.log<-log(tmp$Total_confirmed_cases_perstate,2)
  derivative<-data.frame(matrix(nrow = nrow(tmp),ncol = 4))
  names(derivative)<-c("m.time","mm.time","cumsum","date")
  
  # First derivative
  for(t in 2:nrow(tmp)){
    slope.t<- tmp[t,"Total_confirmed_cases_perstate.log"]- tmp[t-1,"Total_confirmed_cases_perstate.log"]
    derivative[t,"m.time"]<-slope.t
    derivative[t,"date"]<-as.Date(tmp[t,"Date"])
  }
  
  # Second derivative
  for(t in 2:nrow(derivative)){
    slope.t<- derivative[t,"m.time"]- derivative[t-1,"m.time"]
    derivative[t,"mm.time"]<-slope.t
  }
  
  #Compute running sum of second derivative (window = 5). Choose point at which within 0.2
  for(t in running_avg:nrow(derivative)){
    slope.t<- sum(abs(derivative[t:(t-4),"mm.time"])<0.2,na.rm = T)
    derivative[t,"cumsum"]<-slope.t
  }
  
  #Find date -5 from the stablility point
  linear_begin<-min(derivative[!is.na(derivative$cumsum) & derivative$cumsum==running_avg,"date"])-running_avg
  
  return(linear_begin)
}

READ IN DATA

# Q: do we want to archive previous versions? Maybe an auto git mv?

##------------------------------------------
## Download and read in latest data from github
##------------------------------------------
download.file(Corona_Cases.source_url, destfile = Corona_Cases.fn)
Corona_Totals.raw <- read.csv(Corona_Cases.fn, header = T, stringsAsFactors = F)

download.file(Corona_Cases.US.source_url, destfile = Corona_Cases.US.fn)
Corona_Totals.US.raw <- read.csv(Corona_Cases.US.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.source_url, destfile = Corona_Deaths.fn)
Corona_Deaths.raw <- read.csv(Corona_Deaths.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.US.source_url, destfile = Corona_Deaths.US.fn)
Corona_Deaths.US.raw <- read.csv(Corona_Deaths.US.fn, header = T, stringsAsFactors = F)

# latest date on all data:
paste("US deaths:", names(Corona_Deaths.US.raw)[ncol(Corona_Deaths.US.raw)])
## [1] "US deaths: X5.31.20"
paste("US total:", names(Corona_Totals.US.raw)[ncol(Corona_Totals.US.raw)])
## [1] "US total: X5.31.20"
paste("World deaths:", names(Corona_Deaths.raw)[ncol(Corona_Deaths.raw)])
## [1] "World deaths: X5.31.20"
paste("World total:", names(Corona_Totals.raw)[ncol(Corona_Totals.raw)])
## [1] "World total: X5.31.20"

PROCESS DATA

  • Convert to long format
  • Fix date formatting/convert to numeric date
  • Log10 transform total # cases
##------------------------------------------
## Combine death and total data frames
##------------------------------------------
Corona_Totals.raw$case_type<-"total"
Corona_Totals.US.raw$case_type<-"total"
Corona_Deaths.raw$case_type<-"death"
Corona_Deaths.US.raw$case_type<-"death"

# for some reason, Population listed in US death file but not for other data... Weird. When combining, all datasets will have this column, but US deaths is the only useful one.  
Corona_Totals.US.raw$Population<-"NA" 
Corona_Totals.raw$Population<-"NA"
Corona_Deaths.raw$Population<-"NA"

Corona_Cases.raw<-rbind(Corona_Totals.raw,Corona_Deaths.raw)
Corona_Cases.US.raw<-rbind(Corona_Totals.US.raw,Corona_Deaths.US.raw)
#TODO: custom utils- setdiff, intersect names... option to output in merging too
##------------------------------------------
# prepare raw datasets for eventual combining
##------------------------------------------
Corona_Cases.raw$City<-"NA" # US-level data has Cities
Corona_Cases.US.raw$Country_Region<-"US_state" # To differentiate from World-level stats

Corona_Cases.US.raw<-plyr::rename(Corona_Cases.US.raw,c("Province_State"="Province.State",
                                                  "Country_Region"="Country.Region",
                                                  "Long_"="Long",
                                                  "Admin2"="City"))


##------------------------------------------
## Convert to long format
##------------------------------------------
#JHU has a gross file format. It's in wide format with each column is the date in MM/DD/YY. So read this in as raw data but trasnform it to be better suited for analysis
# Furthermore, the World and US level data is formatted differently, containing different columns, etc. Recitfy this and combine the world-level stats with U.S. level stats.

Corona_Cases.long<-rbind(make_long(select(Corona_Cases.US.raw,-c(UID,iso2,iso3,code3,FIPS,Combined_Key))),
make_long(Corona_Cases.raw))


##------------------------------------------
## Fix date formatting, convert to numeric date
##------------------------------------------
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "^X",replacement = "0") # leading 0 read in as X
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "20$",replacement = "2020") # ends in .20 and not 2020
Corona_Cases.long$Date<-as.Date(Corona_Cases.long$Date,format = "%m.%d.%y")
Corona_Cases.long$Date.numeric<-as.numeric(Corona_Cases.long$Date)

kable(table(select(Corona_Cases.long,c("Country.Region","case_type"))),caption = "Number of death and total case longitudinal datapoints per geographical region")
Number of death and total case longitudinal datapoints per geographical region
death total
Afghanistan 131 131
Albania 131 131
Algeria 131 131
Andorra 131 131
Angola 131 131
Antigua and Barbuda 131 131
Argentina 131 131
Armenia 131 131
Australia 1048 1048
Austria 131 131
Azerbaijan 131 131
Bahamas 131 131
Bahrain 131 131
Bangladesh 131 131
Barbados 131 131
Belarus 131 131
Belgium 131 131
Belize 131 131
Benin 131 131
Bhutan 131 131
Bolivia 131 131
Bosnia and Herzegovina 131 131
Botswana 131 131
Brazil 131 131
Brunei 131 131
Bulgaria 131 131
Burkina Faso 131 131
Burma 131 131
Burundi 131 131
Cabo Verde 131 131
Cambodia 131 131
Cameroon 131 131
Canada 1834 1834
Central African Republic 131 131
Chad 131 131
Chile 131 131
China 4323 4323
Colombia 131 131
Comoros 131 131
Congo (Brazzaville) 131 131
Congo (Kinshasa) 131 131
Costa Rica 131 131
Cote d’Ivoire 131 131
Croatia 131 131
Cuba 131 131
Cyprus 131 131
Czechia 131 131
Denmark 393 393
Diamond Princess 131 131
Djibouti 131 131
Dominica 131 131
Dominican Republic 131 131
Ecuador 131 131
Egypt 131 131
El Salvador 131 131
Equatorial Guinea 131 131
Eritrea 131 131
Estonia 131 131
Eswatini 131 131
Ethiopia 131 131
Fiji 131 131
Finland 131 131
France 1441 1441
Gabon 131 131
Gambia 131 131
Georgia 131 131
Germany 131 131
Ghana 131 131
Greece 131 131
Grenada 131 131
Guatemala 131 131
Guinea 131 131
Guinea-Bissau 131 131
Guyana 131 131
Haiti 131 131
Holy See 131 131
Honduras 131 131
Hungary 131 131
Iceland 131 131
India 131 131
Indonesia 131 131
Iran 131 131
Iraq 131 131
Ireland 131 131
Israel 131 131
Italy 131 131
Jamaica 131 131
Japan 131 131
Jordan 131 131
Kazakhstan 131 131
Kenya 131 131
Korea, South 131 131
Kosovo 131 131
Kuwait 131 131
Kyrgyzstan 131 131
Laos 131 131
Latvia 131 131
Lebanon 131 131
Lesotho 131 131
Liberia 131 131
Libya 131 131
Liechtenstein 131 131
Lithuania 131 131
Luxembourg 131 131
Madagascar 131 131
Malawi 131 131
Malaysia 131 131
Maldives 131 131
Mali 131 131
Malta 131 131
Mauritania 131 131
Mauritius 131 131
Mexico 131 131
Moldova 131 131
Monaco 131 131
Mongolia 131 131
Montenegro 131 131
Morocco 131 131
Mozambique 131 131
MS Zaandam 131 131
Namibia 131 131
Nepal 131 131
Netherlands 655 655
New Zealand 131 131
Nicaragua 131 131
Niger 131 131
Nigeria 131 131
North Macedonia 131 131
Norway 131 131
Oman 131 131
Pakistan 131 131
Panama 131 131
Papua New Guinea 131 131
Paraguay 131 131
Peru 131 131
Philippines 131 131
Poland 131 131
Portugal 131 131
Qatar 131 131
Romania 131 131
Russia 131 131
Rwanda 131 131
Saint Kitts and Nevis 131 131
Saint Lucia 131 131
Saint Vincent and the Grenadines 131 131
San Marino 131 131
Sao Tome and Principe 131 131
Saudi Arabia 131 131
Senegal 131 131
Serbia 131 131
Seychelles 131 131
Sierra Leone 131 131
Singapore 131 131
Slovakia 131 131
Slovenia 131 131
Somalia 131 131
South Africa 131 131
South Sudan 131 131
Spain 131 131
Sri Lanka 131 131
Sudan 131 131
Suriname 131 131
Sweden 131 131
Switzerland 131 131
Syria 131 131
Taiwan* 131 131
Tajikistan 131 131
Tanzania 131 131
Thailand 131 131
Timor-Leste 131 131
Togo 131 131
Trinidad and Tobago 131 131
Tunisia 131 131
Turkey 131 131
Uganda 131 131
Ukraine 131 131
United Arab Emirates 131 131
United Kingdom 1441 1441
Uruguay 131 131
US 131 131
US_state 427191 427191
Uzbekistan 131 131
Venezuela 131 131
Vietnam 131 131
West Bank and Gaza 131 131
Western Sahara 131 131
Yemen 131 131
Zambia 131 131
Zimbabwe 131 131
# Decouple population and lat/long data, refactor to make it more tidy
metadata_columns<-c("Lat","Long","Population")
metadata<-unique(select(filter(Corona_Cases.long,case_type=="death"),c("Country.Region","Province.State","City",all_of(metadata_columns))))
Corona_Cases.long<-select(Corona_Cases.long,-all_of(metadata_columns))

# Some counties are not summarized on the country level. collapse all but US
Corona_Cases.long<-rbind.fill(ddply(filter(Corona_Cases.long,!Country.Region=="US_state"),c("case_type","Country.Region","Date","Date.numeric"),summarise,Total_confirmed_cases=sum(Total_confirmed_cases)),filter(Corona_Cases.long,Country.Region=="US_state"))

# Put total case and deaths side-by-side (wide)
Corona_Cases<-spread(Corona_Cases.long,key = case_type,value = Total_confirmed_cases)

#Compute moratlity rate
Corona_Cases$mortality_rate<-Corona_Cases$death/Corona_Cases$total

#TMP
Corona_Cases<-plyr::rename(Corona_Cases,c("total"="Total_confirmed_cases","death"="Total_confirmed_deaths"))

##------------------------------------------
## log10 transform total # cases
##------------------------------------------
Corona_Cases$Total_confirmed_cases.log<-log(Corona_Cases$Total_confirmed_cases,10)
Corona_Cases$Total_confirmed_deaths.log<-log(Corona_Cases$Total_confirmed_deaths,10)
##------------------------------------------
       
##------------------------------------------
## Compute # of days since 100th for US data
##------------------------------------------

# Find day that 100th case was found for Country/Province. NOTE: Non US countries may have weird provinces. For example, Frane is summairzed at the country level but also had 3 providences. I've only ensured the U.S. case100 works... so the case100_date for U.S. is summarized both for the entire country (regardless of state) and on a per-state level. 
# TODO: consider city-level summary as well. This data may be sparse

Corona_Cases<-merge(Corona_Cases,ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,case100_date=min(Date.numeric)))
Corona_Cases$Days_since_100<-Corona_Cases$Date.numeric-Corona_Cases$case100_date

##------------------------------------------
## Add population and lat/long data (CURRENTLY US ONLY)
##------------------------------------------

kable(filter(metadata,(is.na(Country.Region) | is.na(Population) )) %>% select(c("Country.Region","Province.State","City")) %>% unique(),caption = "Regions for which either population or Country is NA")
Regions for which either population or Country is NA
Country.Region Province.State City
# Drop missing data 
metadata<-filter(metadata,!(is.na(Country.Region) | is.na(Population) ))
# Convert remaining pop to numeric
metadata$Population<-as.numeric(metadata$Population)
## Warning: NAs introduced by coercion
# Add metadata to cases
Corona_Cases<-merge(Corona_Cases,metadata,all.x = T)

##------------------------------------------
## Compute total and death cases relative to population 
##------------------------------------------

Corona_Cases$Total_confirmed_cases.per100<-100*Corona_Cases$Total_confirmed_cases/Corona_Cases$Population
Corona_Cases$Total_confirmed_deaths.per100<-100*Corona_Cases$Total_confirmed_deaths/Corona_Cases$Population


##------------------------------------------
## Filter df for US state-wide stats
##------------------------------------------

Corona_Cases.US_state<-filter(Corona_Cases,Country.Region=="US_state" & Total_confirmed_cases>0 )
Corona_Cases.US_state[!is.na(Corona_Cases.US_state$Province.State) & Corona_Cases.US_state$Province.State=="Pennsylvania" & Corona_Cases.US_state$City=="Delaware","City"]<-"Delaware_PA"

kable(table(select(Corona_Cases.US_state,c("Province.State"))),caption = "Number of longitudinal datapoints (total/death) per state")
Number of longitudinal datapoints (total/death) per state
Var1 Freq
Alabama 4601
Alaska 871
Arizona 1151
Arkansas 4830
California 4389
Colorado 4187
Connecticut 682
Delaware 283
Diamond Princess 76
District of Columbia 77
Florida 4906
Georgia 10976
Grand Princess 77
Guam 77
Hawaii 397
Idaho 2199
Illinois 6321
Indiana 6377
Iowa 5882
Kansas 4885
Kentucky 7067
Louisiana 4605
Maine 1169
Maryland 1793
Massachusetts 1155
Michigan 5579
Minnesota 5272
Mississippi 5710
Missouri 6349
Montana 1970
Nebraska 3683
Nevada 877
New Hampshire 807
New Jersey 1729
New Mexico 1940
New York 4381
North Carolina 6736
North Dakota 2302
Northern Mariana Islands 62
Ohio 6005
Oklahoma 4563
Oregon 2311
Pennsylvania 4735
Puerto Rico 77
Rhode Island 454
South Carolina 3329
South Dakota 2918
Tennessee 6495
Texas 13698
Utah 1058
Vermont 1080
Virgin Islands 77
Virginia 8502
Washington 2995
West Virginia 3119
Wisconsin 4562
Wyoming 1412
Corona_Cases.US_state<-merge(Corona_Cases.US_state,ddply(filter(Corona_Cases.US_state,Total_confirmed_cases>100),c("Province.State"),summarise,case100_date_state=min(Date.numeric)))
Corona_Cases.US_state$Days_since_100_state<-Corona_Cases.US_state$Date.numeric-Corona_Cases.US_state$case100_date_state

ANALYSIS

Q1: What is the trend in cases, mortality across geopgraphical regions?

Plot # of cases vs time
* For each geographical set:
* comparative longitudinal case trend (absolute & log scale)
* comparative longitudinal mortality trend
* death vs total correlation

question dataset x y color facet pch dimentions
comparative_longitudinal_case_trend long time log_cases geography none (case type?) case_type [15, 50, 4] geography x (2 scale?) case type
comparative longitudinal case trend long time cases geography case_type ? [15, 50, 4] geography x (2+ scale) case type
comparative longitudinal mortality trend wide time mortality rate geography none none [15, 50, 4] geography
death vs total correlation wide cases deaths geography none none [15, 50, 4] geography
# total cases vs time
# death cases vs time
# mortality rate vs time
# death vs mortality


  # death vs mortality
  # total & death case vs time (same plot)

#<question> <x> <y> <colored> <facet> <dataset>
## trend in case/deaths over time, comapred across regions <time> <log cases> <geography*> <none> <.wide>
## trend in case/deaths over time, comapred across regions <time> <cases> <geography*> <case_type> <.long>
## trend in mortality rate over time, comapred across regions <time> <mortality rate> <geography*> <none>
## how are death/mortality related/correlated? <time> <log cases> <geography*> <none>
## how are death and case load correlated? <cases> <deaths>

# lm for each?? - > apply lm from each region starting from 100th case. m, b associated with each.
    # input: geographical regsion, logcase vs day (100th case)
    # output: m, b for each geographical region ID



#total/death on same plot-  diffeer by 2 logs, so when plotting log, use pch. when plotting absolute, need to use free scales
#when plotting death and case on same, melt. 

#CoronaCases - > filter sets (3)
  #world - choose countries with sufficent data

N<-ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,n=length(Country.Region))
ggplot(filter(N,n<100),aes(x=n))+
  geom_histogram()+
  default_theme+
  ggtitle("Distribution of number of days with at least 100 confirmed cases for each region")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kable(arrange(N,-n),caption="Sorted number of days with at least 100 confirmed cases")
Sorted number of days with at least 100 confirmed cases
Country.Region n
US_state 43829
China 131
Diamond Princess 112
Korea, South 102
Japan 101
Italy 99
Iran 96
Singapore 93
France 92
Germany 92
Spain 91
US 90
Switzerland 88
United Kingdom 88
Belgium 87
Netherlands 87
Norway 87
Sweden 87
Austria 85
Malaysia 84
Australia 83
Bahrain 83
Denmark 83
Canada 82
Qatar 82
Iceland 81
Brazil 80
Czechia 80
Finland 80
Greece 80
Iraq 80
Israel 80
Portugal 80
Slovenia 80
Egypt 79
Estonia 79
India 79
Ireland 79
Kuwait 79
Philippines 79
Poland 79
Romania 79
Saudi Arabia 79
Indonesia 78
Lebanon 78
San Marino 78
Thailand 78
Chile 77
Pakistan 77
Luxembourg 76
Peru 76
Russia 76
Ecuador 75
Mexico 75
Slovakia 75
South Africa 75
United Arab Emirates 75
Armenia 74
Colombia 74
Croatia 74
Panama 74
Serbia 74
Taiwan* 74
Turkey 74
Argentina 73
Bulgaria 73
Latvia 73
Uruguay 73
Algeria 72
Costa Rica 72
Dominican Republic 72
Hungary 72
Andorra 71
Bosnia and Herzegovina 71
Jordan 71
Lithuania 71
Morocco 71
New Zealand 71
North Macedonia 71
Vietnam 71
Albania 70
Cyprus 70
Malta 70
Moldova 70
Brunei 69
Burkina Faso 69
Sri Lanka 69
Tunisia 69
Ukraine 68
Azerbaijan 67
Ghana 67
Kazakhstan 67
Oman 67
Senegal 67
Venezuela 67
Afghanistan 66
Cote d’Ivoire 66
Cuba 65
Mauritius 65
Uzbekistan 65
Cambodia 64
Cameroon 64
Honduras 64
Nigeria 64
West Bank and Gaza 64
Belarus 63
Georgia 63
Bolivia 62
Kosovo 62
Kyrgyzstan 62
Montenegro 62
Congo (Kinshasa) 61
Kenya 60
Niger 59
Guinea 58
Rwanda 58
Trinidad and Tobago 58
Paraguay 57
Bangladesh 56
Djibouti 54
El Salvador 53
Guatemala 52
Madagascar 51
Mali 50
Congo (Brazzaville) 47
Jamaica 47
Gabon 45
Somalia 45
Tanzania 45
Ethiopia 44
Burma 43
Sudan 42
Liberia 41
Maldives 39
Equatorial Guinea 38
Cabo Verde 36
Sierra Leone 34
Guinea-Bissau 33
Togo 33
Zambia 32
Eswatini 31
Chad 30
Tajikistan 29
Haiti 27
Sao Tome and Principe 27
Benin 25
Nepal 25
Uganda 25
Central African Republic 24
South Sudan 24
Guyana 22
Mozambique 21
Yemen 17
Mongolia 16
Mauritania 13
Nicaragua 13
Malawi 7
Syria 7
Zimbabwe 5
Bahamas 4
Libya 4
Comoros 2
# Pick top 15 countries with data
max_colors<-12
# find way to fix this- China has diff provences. Plot doesnt look right...
sufficient_data<-arrange(filter(N,!Country.Region %in% c("US_state", "Diamond Princess")),-n)[1:max_colors,]
kable(sufficient_data,caption = paste0("Top ",max_colors," countries with sufficient data"))
Top 12 countries with sufficient data
Country.Region n
China 131
Korea, South 102
Japan 101
Italy 99
Iran 96
Singapore 93
France 92
Germany 92
Spain 91
US 90
Switzerland 88
United Kingdom 88
Corona_Cases.world<-filter(Corona_Cases,Country.Region %in% c(sufficient_data$Country.Region))


  #us 
  #    - by state
Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
# summarize 
#!City %in% c("Unassigned") 
  #    - specific cities
#mortality_rate!=Inf & mortality_rate<=1
head(Corona_Cases)
##   Country.Region Province.State City       Date Date.numeric
## 1    Afghanistan           <NA> <NA> 2020-02-01        18293
## 2    Afghanistan           <NA> <NA> 2020-03-10        18331
## 3    Afghanistan           <NA> <NA> 2020-03-09        18330
## 4    Afghanistan           <NA> <NA> 2020-05-08        18390
## 5    Afghanistan           <NA> <NA> 2020-05-27        18409
## 6    Afghanistan           <NA> <NA> 2020-03-20        18341
##   Total_confirmed_deaths Total_confirmed_cases mortality_rate
## 1                      0                     0            NaN
## 2                      0                     5     0.00000000
## 3                      0                     4     0.00000000
## 4                    109                  3778     0.02885124
## 5                    227                 12456     0.01822415
## 6                      0                    24     0.00000000
##   Total_confirmed_cases.log Total_confirmed_deaths.log case100_date
## 1                      -Inf                       -Inf        18348
## 2                  0.698970                       -Inf        18348
## 3                  0.602060                       -Inf        18348
## 4                  3.577262                   2.037426        18348
## 5                  4.095379                   2.356026        18348
## 6                  1.380211                       -Inf        18348
##   Days_since_100 Lat Long Population Total_confirmed_cases.per100
## 1            -55  NA   NA         NA                           NA
## 2            -17  NA   NA         NA                           NA
## 3            -18  NA   NA         NA                           NA
## 4             42  NA   NA         NA                           NA
## 5             61  NA   NA         NA                           NA
## 6             -7  NA   NA         NA                           NA
##   Total_confirmed_deaths.per100
## 1                            NA
## 2                            NA
## 3                            NA
## 4                            NA
## 5                            NA
## 6                            NA
Corona_Cases[!is.na(Corona_Cases$Province.State) & Corona_Cases$Province.State=="Pennsylvania" & Corona_Cases$City=="Delaware","City"]<-"Delaware_PA"


Corona_Cases.UScity<-filter(Corona_Cases,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") & City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA"))


measure_vars_long<-c("Total_confirmed_cases.log","Total_confirmed_cases","Total_confirmed_deaths","Total_confirmed_deaths.log")
melt_arg_list<-list(variable.name = "case_type",value.name = "cases",measure.vars = c("Total_confirmed_cases","Total_confirmed_deaths"))
melt_arg_list$data=NULL


melt_arg_list$data=select(Corona_Cases.world,-ends_with(match = "log"))
Corona_Cases.world.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.UScity,-ends_with(match = "log"))
Corona_Cases.UScity.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.US_state,-ends_with(match = "log"))
Corona_Cases.US_state.long<-do.call(melt,melt_arg_list)

Corona_Cases.world.long$cases.log<-log(Corona_Cases.world.long$cases,10)
Corona_Cases.US_state.long$cases.log<-log(Corona_Cases.US_state.long$cases,10)
Corona_Cases.UScity.long$cases.log<-log(Corona_Cases.UScity.long$cases,10)


# what is the current death and total case load for US? For world? For states?
#-absolute
#-log

# what is mortality rate (US, world)
#-absolute

#how is death and case correlated? (US, world)
#-absolute
#Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
#Corona_Cases.US.case100<-filter(Corona_Cases.US, Days_since_100>=0)
# linear model parameters
#(model_fit<-lm(formula = Total_confirmed_cases.log~Days_since_100,data= Corona_Cases.US.case100 ))

#(slope<-model_fit$coefficients[2])
#(intercept<-model_fit$coefficients[1])

# Correlation coefficient
#cor(x = Corona_Cases.US.case100$Days_since_100,y = Corona_Cases.US.case100$Total_confirmed_cases.log)

##------------------------------------------
## Plot World Data
##------------------------------------------
# Timestamp for world
timestamp_plot.world<-paste("Most recent date for which data available:",max(Corona_Cases.world$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")


# Base template for plots
baseplot.world<-ggplot(data=NULL,aes(x=Days_since_100,col=Country.Region))+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))


##/////////////////////////
### Plot Longitudinal cases

(Corona_Cases.world.long.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world)
    )

(Corona_Cases.world.loglong.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases.log))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases.log))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world))

##/////////////////////////
### Plot Longitudinal mortality rate

(Corona_Cases.world.mortality.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world,aes(y=mortality_rate))+
    geom_line(data=Corona_Cases.world,aes(y=mortality_rate))+
    ylim(c(0,0.3))+
    ggtitle(timestamp_plot.world))
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 row(s) containing missing values (geom_path).

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.world.casecor.plot<-ggplot(Corona_Cases.world,aes(x=Total_confirmed_cases,y=Total_confirmed_deaths,col=Country.Region))+
  geom_point()+
  geom_line()+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
    ggtitle(timestamp_plot.world))

### Write polots

write_plot(Corona_Cases.world.long.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.long.plot.png"
write_plot(Corona_Cases.world.loglong.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.loglong.plot.png"
write_plot(Corona_Cases.world.mortality.plot,wd = results_dir)
## Warning: Removed 100 rows containing missing values (geom_point).

## Warning: Removed 100 row(s) containing missing values (geom_path).
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.mortality.plot.png"
write_plot(Corona_Cases.world.casecor.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.casecor.plot.png"
##------------------------------------------
## Plot US State Data
##-----------------------------------------

baseplot.US<-ggplot(data=NULL,aes(x=Days_since_100_state,col=case_type))+
  default_theme+
  facet_wrap(~Province.State)+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))

Corona_Cases.US_state.long.plot<-baseplot.US+geom_point(data=Corona_Cases.US_state.long,aes(y=cases.log))
##------------------------------------------
## Plot US City Data
##-----------------------------------------

Corona_Cases.US.plotdata<-filter(Corona_Cases.US_state,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") &
                                   City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA") &
                                   Total_confirmed_cases>0) 
timestamp_plot<-paste("Most recent date for which data available:",max(Corona_Cases.US.plotdata$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")

city_colors<-c("Bucks"='#beaed4',"Baltimore City"='#386cb0', "New York"='#7fc97f',"Burlington"='#fdc086',"Cape May"="#e78ac3","Delaware_PA"="#b15928")

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.city.loglong.plot<-ggplot(melt(Corona_Cases.US.plotdata,measure.vars = c("Total_confirmed_cases.log","Total_confirmed_deaths.log"),variable.name = "case_type",value.name = "cases"),aes(x=Date,y=cases,col=City,pch=case_type))+
  geom_point(size=4)+
    geom_line()+
  default_theme+
  #facet_wrap(~case_type)+
    ggtitle(paste("Log10 total and death cases over time,",timestamp_plot))+
theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
    scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.long.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State,scales = "free_y")+
  ggtitle(paste("MD, PA, NJ total cases over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))
+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.mortality.plot<-ggplot(Corona_Cases.US.plotdata,aes(x=Date,y=mortality_rate,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Mortality rate (deaths/total) over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.casecor.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(y=Total_confirmed_deaths,x=Total_confirmed_cases,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Correlation of death vs total cases,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
  scale_color_manual(values = city_colors))

(Corona_Cases.city.long.normalized.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases.per100,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State)+
  ggtitle(paste("MD, PA, NJ total cases over time per 100 people,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)  +
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

write_plot(Corona_Cases.city.long.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.plot.png"
write_plot(Corona_Cases.city.loglong.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.loglong.plot.png"
write_plot(Corona_Cases.city.mortality.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.mortality.plot.png"
write_plot(Corona_Cases.city.casecor.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.casecor.plot.png"
write_plot(Corona_Cases.city.long.normalized.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.normalized.plot.png"

Q1b what is the model

Fit the cases to a linear model 1. Find time at which the case vs date becomes linear in each plot
2. Fit linear model for each city

# What is the predict # of cases for the next few days?
# How is the model performing historically?

Corona_Cases.US_state.summary<-ddply(Corona_Cases.US_state,
                                     c("Province.State","Date"),
                                     summarise,
                                     Total_confirmed_cases_perstate=sum(Total_confirmed_cases)) %>% 
    filter(Total_confirmed_cases_perstate>100)

# Compute the states with the most cases (for coloring and for linear model)
top_states_totals<-head(ddply(Corona_Cases.US_state.summary,c("Province.State"),summarise, Total_confirmed_cases_perstate.max=max(Total_confirmed_cases_perstate)) %>% arrange(-Total_confirmed_cases_perstate.max),n=max_colors)

kable(top_states_totals,caption = "Top 12 States, total count ")
top_states<-top_states_totals$Province.State

# Manually fix states so that Maryland is switched out for New York
top_states_modified<-c(top_states[top_states !="New York"],"Maryland")

# Plot with all states:
(Corona_Cases.US_state.summary.plot<-ggplot(Corona_Cases.US_state.summary,aes(x=Date,y=Total_confirmed_cases_perstate))+
  geom_point()+
  geom_point(data=filter(Corona_Cases.US_state.summary,Province.State %in% top_states),aes(col=Province.State))+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  default_theme+
  theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
  ggtitle("Total confirmed cases per state, top 12 colored")+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Fit linear model to time vs total cases
##-----------------------------------------

# First, find the date at which each state's cases vs time becomes lienar (2nd derivative is about 0)
li<-ddply(Corona_Cases.US_state.summary,c("Province.State"),find_linear_index)

# Compute linear model for each state starting at the point at which data becomes linear
for(i in 1:nrow(li)){
  Province.State.i<-li[i,"Province.State"]
  date.i<-li[i,"V1"]
  data.i<-filter(Corona_Cases.US_state.summary,Province.State==Province.State.i & as.numeric(Date) >= date.i)
  model_results<-lm(data.i,formula = Total_confirmed_cases_perstate~Date)
  slope<-model_results$coefficients[2]
  intercept<-model_results$coefficients[1]
  li[li$Province.State==Province.State.i,"m"]<-slope
  li[li$Province.State==Province.State.i,"b"]<-intercept
  }

# Compute top state case load with fitted model

(Corona_Cases.US_state.lm.plot<-ggplot(filter(Corona_Cases.US_state.summary,Province.State %in% top_states_modified ))+
    geom_abline(data=filter(li,Province.State %in% top_states_modified),
                aes(slope = m,intercept = b,col=Province.State),lty=2)+
    geom_point(aes(x=Date,y=Total_confirmed_cases_perstate,col=Province.State))+
    scale_color_brewer(type = "qualitative",palette = "Paired")+
    default_theme+
    theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
    ggtitle("Total confirmed cases per state, top 12 colored")+
    scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Predict the number of total cases over the next week
##-----------------------------------------

predicted_days<-c(0,1,2,3,7)+as.numeric(as.Date("2020-04-20"))

predicted_days_df<-data.frame(matrix(ncol=3))
names(predicted_days_df)<-c("Province.State","days","Total_confirmed_cases_perstate")

# USe model parameters to estiamte case loads
for(state.i in top_states_modified){
  predicted_days_df<-rbind(predicted_days_df,
                           data.frame(Province.State=state.i,
                                      prediction_model(m = li[li$Province.State==state.i,"m"],
                                                       b =li[li$Province.State==state.i,"b"] ,
                                                       days =predicted_days )))
  }

predicted_days_df$Date<-as.Date(predicted_days_df$days,origin="1970-01-01")

kable(predicted_days_df,caption = "Predicted total cases over the next week for selected states")

##------------------------------------------
## Write plots
##-----------------------------------------

write_plot(Corona_Cases.US_state.summary.plot,wd = results_dir)
write_plot(Corona_Cases.US_state.lm.plot,wd = results_dir)

##------------------------------------------
## Write tables
##-----------------------------------------

write.csv(predicted_days_df,file = paste0(results_dir,"predicted_total_cases_days.csv"),quote = F,row.names = F)

Q2: What is the predicted number of cases?

What is the prediction of COVID-19 based on model thus far? Additional questions:

WHy did it take to day 40 to start a log linear trend? How long will it be till x number of cases? When will the plateu happen? Are any effects noticed with social distancing? Delays

##------------------------------------------
## Prediction and Prediction Accuracy
##------------------------------------------


today_num<-max(Corona_Cases.US$Days_since_100)
predicted_days<-today_num+c(1,2,3,7)

#mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
#today:
Corona_Cases.US[Corona_Cases.US$Days_since_100==(today_num-1),]
Corona_Cases.US[Corona_Cases.US$Days_since_100==today_num,]
Corona_Cases.US$type<-"Historical"


#prediction_values<-prediction_model(m=slope,b=intercept,days = predicted_days)$Total_confirmed_cases

histoical_model<-data.frame(date=today_num,m=slope,b=intercept)
tmp<-data.frame(state=rep(c("A","B"),each=3),x=c(1,2,3,4,5,6))
tmp$y<-c(tmp[1:3,"x"]+5,tmp[4:6,"x"]*5+1)
ddply(tmp,c("state"))
lm(data =tmp,formula = y~x )

train_lm<-function(input_data,subset_coulmn,formula_input){
case_models <- dlply(input_data, subset_coulmn, lm, formula = formula_input)
case_models.parameters <- ldply(case_models, coef)
case_models.parameters<-rename(case_models.parameters,c("b"="(Intercept)","m"=subset_coulmn))
return(case_models.parameters)
}

train_lm(tmp,"state")

 dlply(input_data, subset_coulmn, lm,m=)
 
# model for previous y days
#historical_model_predictions<-data.frame(day_x=NULL,Days_since_100=NULL,Total_confirmed_cases=NULL,Total_confirmed_cases.log=NULL)
# for(i in c(1,2,3,4,5,6,7,8,9,10)){
#   #i<-1
# day_x<-today_num-i # 1, 2, 3, 4
# day_x_nextweek<-day_x+c(1,2,3)
# model_fit_x<-lm(data = filter(Corona_Cases.US.case100,Days_since_100 < day_x),formula = Total_confirmed_cases.log~Days_since_100)
# prediction_day_x_nextweek<-prediction_model(m = model_fit_x$coefficients[2],b = model_fit_x$coefficients[1],days = day_x_nextweek)
# prediction_day_x_nextweek$type<-"Predicted"
# acutal_day_x_nextweek<-filter(Corona_Cases.US,Days_since_100 %in% day_x_nextweek) %>% select(c(Days_since_100,Total_confirmed_cases,Total_confirmed_cases.log))
# acutal_day_x_nextweek$type<-"Historical"
# historical_model_predictions.i<-data.frame(day_x=day_x,rbind(acutal_day_x_nextweek,prediction_day_x_nextweek))
# historical_model_predictions<-rbind(historical_model_predictions.i,historical_model_predictions)
# }

#historical_model_predictions.withHx<-rbind.fill(historical_model_predictions,data.frame(Corona_Cases.US,type="Historical"))
#historical_model_predictions.withHx$Total_confirmed_cases.log2<-log(historical_model_predictions.withHx$Total_confirmed_cases,2)

(historical_model_predictions.plot<-ggplot(historical_model_predictions.withHx,aes(x=Days_since_100,y=Total_confirmed_cases.log,col=type))+
    geom_point(size=3)+
    default_theme+
    theme(legend.position = "bottom")+ 
      #geom_abline(slope = slope,intercept =intercept,lty=2)+
    #facet_wrap(~case_type,ncol=1)+
    scale_color_manual(values = c("Historical"="#377eb8","Predicted"="#e41a1c")))
write_plot(historical_model_predictions.plot,wd=results_dir)

Q3: What is the effect on social distancing, descreased mobility on case load?

Load data from Google which compoutes % change in user mobility relative to baseline for * Recreation
* Workplace
* Residence
* Park
* Grocery

Data from https://www.google.com/covid19/mobility/

# See pre-processing section for script on gathering mobility data

# UNDER DEVELOPMENT

mobility<-read.csv("/Users/stevensmith/Projects/MIT_COVID19/mobility.csv",header = T,stringsAsFactors = F)
#mobility$Retail_Recreation<-as.numeric(sub(mobility$Retail_Recreation,pattern = "%",replacement = ""))
#mobility$Workplace<-as.numeric(sub(mobility$Workplace,pattern = "%",replacement = ""))
#mobility$Residential<-as.numeric(sub(mobility$Residential,pattern = "%",replacement = ""))

##------------------------------------------
## Show relationship between mobility and caseload
##------------------------------------------
mobility$County<-gsub(mobility$County,pattern = " County",replacement = "")
Corona_Cases.US_state.mobility<-merge(Corona_Cases.US_state,plyr::rename(mobility,c("State"="Province.State","County"="City")))

#Corona_Cases.US_state.tmp<-merge(metadata,Corona_Cases.US_state.tmp)
# Needs to happen upsteam, see todos
#Corona_Cases.US_state.tmp$Total_confirmed_cases.perperson<-Corona_Cases.US_state.tmp$Total_confirmed_cases/as.numeric(Corona_Cases.US_state.tmp$Population)
mobility_measures<-c("Retail_Recreation","Grocery_Pharmacy","Parks","Transit","Workplace","Residential")

plot_data<-filter(Corona_Cases.US_state.mobility, Date.numeric==max(Corona_Cases.US_state$Date.numeric) ) %>% melt(measure.vars=mobility_measures) 
plot_data$value<-as.numeric(gsub(plot_data$value,pattern = "%",replacement = ""))
plot_data<-filter(plot_data,!is.na(value))

(mobility.plot<-ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_grid(Province.State~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases per 100 people(Today)"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

(mobility.global.plot<-ggplot(plot_data,aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_wrap(~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases (Today) per 100 people"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

plot_data.permobility_summary<-ddply(plot_data,c("Province.State","variable"),summarise,cor=cor(y =Total_confirmed_cases.per100,x=value),median_change=median(x=value)) %>% arrange(-abs(cor))

kable(plot_data.permobility_summary,caption = "Ranked per-state mobility correlation with total confirmed cases")
Ranked per-state mobility correlation with total confirmed cases
Province.State variable cor median_change
Alaska Transit -1.0000000 -63.0
Delaware Retail_Recreation 1.0000000 -39.5
Delaware Grocery_Pharmacy 1.0000000 -17.5
Delaware Transit 1.0000000 -37.0
Delaware Parks -1.0000000 20.5
Delaware Workplace 1.0000000 -37.0
Delaware Residential -1.0000000 14.0
Hawaii Retail_Recreation 0.9970242 -56.0
Hawaii Grocery_Pharmacy 0.9784683 -34.0
New Hampshire Parks 0.9567687 -20.0
Maine Transit -0.9114789 -50.0
Connecticut Grocery_Pharmacy -0.8984223 -6.0
Utah Residential -0.8798031 12.0
Vermont Parks 0.8545779 -35.5
South Dakota Parks 0.8521013 -26.0
Alaska Residential 0.8316549 13.0
Utah Transit -0.8002709 -18.0
Wyoming Parks -0.7910228 -4.0
Alaska Grocery_Pharmacy -0.7837564 -7.0
Hawaii Residential -0.7604103 19.0
Connecticut Transit -0.7543418 -50.0
Rhode Island Workplace -0.7539745 -39.5
Massachusetts Workplace -0.7526173 -39.0
Alaska Workplace -0.7187151 -33.0
Wyoming Transit -0.7116386 -17.0
Hawaii Parks 0.7097505 -72.0
Arizona Grocery_Pharmacy -0.6812745 -15.0
Utah Parks -0.6657061 17.0
Vermont Grocery_Pharmacy -0.6495288 -25.0
Hawaii Transit 0.6494429 -89.0
New York Workplace -0.6485195 -34.5
Maine Workplace -0.6464294 -30.0
Rhode Island Retail_Recreation -0.6296900 -45.0
Utah Workplace -0.6199464 -37.0
Rhode Island Residential -0.6121146 18.5
New Jersey Parks -0.6006711 -6.0
New York Retail_Recreation -0.5873466 -46.0
Nebraska Workplace 0.5845368 -32.0
New Jersey Workplace -0.5724372 -44.0
North Dakota Parks 0.5424609 -34.0
Arizona Retail_Recreation -0.5392078 -42.5
New York Parks 0.5277674 20.0
Connecticut Residential 0.5247950 14.0
North Dakota Retail_Recreation -0.5146004 -42.0
Maine Parks 0.5119936 -31.0
Connecticut Retail_Recreation -0.5089735 -45.0
Hawaii Workplace 0.5059401 -46.0
Massachusetts Retail_Recreation -0.5021141 -44.0
Montana Parks -0.4843627 -58.0
Connecticut Workplace -0.4836657 -39.0
New Jersey Grocery_Pharmacy -0.4811326 2.5
Arizona Residential 0.4810465 13.0
Nebraska Residential -0.4799799 14.0
Kansas Parks 0.4775783 72.0
New Mexico Grocery_Pharmacy -0.4764074 -11.0
New Jersey Retail_Recreation -0.4723139 -62.5
Montana Residential 0.4700273 14.0
Rhode Island Parks 0.4677024 52.0
Iowa Parks -0.4667747 28.5
West Virginia Parks 0.4585037 -33.0
Wyoming Workplace -0.4542857 -31.0
New Mexico Residential 0.4496831 13.5
New Mexico Parks 0.4415385 -31.5
Nevada Transit -0.4396740 -20.0
Illinois Transit -0.4391247 -31.0
Pennsylvania Workplace -0.4317460 -36.0
South Carolina Workplace 0.4295411 -30.0
Vermont Residential 0.4272253 11.5
New Jersey Transit -0.4184014 -50.5
Maryland Workplace -0.4120638 -35.0
California Transit -0.4116113 -42.0
Kentucky Parks -0.4096892 28.5
Arizona Transit 0.4067006 -38.0
Alabama Grocery_Pharmacy -0.4064703 -2.0
Massachusetts Grocery_Pharmacy -0.4056189 -7.0
New Hampshire Residential -0.4053966 14.0
California Residential 0.4045164 14.0
Maryland Grocery_Pharmacy -0.3895021 -10.0
Idaho Workplace -0.3886576 -29.0
Idaho Grocery_Pharmacy -0.3847052 -5.5
Wisconsin Transit -0.3783162 -23.5
Pennsylvania Retail_Recreation -0.3761957 -45.0
Alabama Workplace -0.3739907 -29.0
New York Transit -0.3718160 -48.0
Idaho Residential -0.3700789 11.0
Idaho Transit -0.3647859 -30.0
Michigan Parks 0.3525561 30.0
New Mexico Retail_Recreation -0.3493882 -42.5
Wyoming Grocery_Pharmacy -0.3419964 -10.0
Nebraska Grocery_Pharmacy 0.3392163 -0.5
California Parks -0.3339155 -38.5
Alaska Retail_Recreation 0.3276546 -39.0
North Carolina Grocery_Pharmacy 0.3229444 0.0
Maine Retail_Recreation -0.3165382 -42.0
North Dakota Workplace 0.3140992 -40.0
Montana Workplace -0.3091747 -40.5
Virginia Transit -0.3074662 -33.0
Minnesota Transit -0.3074602 -28.5
Vermont Retail_Recreation 0.3032617 -57.0
Florida Residential 0.3023606 14.0
Alabama Transit -0.3011472 -36.5
Maryland Retail_Recreation -0.2943343 -39.0
Pennsylvania Parks 0.2922873 12.0
North Carolina Workplace 0.2862582 -31.0
Colorado Residential 0.2845345 14.0
Mississippi Residential 0.2836119 13.0
Arkansas Retail_Recreation -0.2833954 -30.0
Illinois Workplace -0.2810287 -30.5
Oregon Grocery_Pharmacy 0.2756644 -7.0
Nevada Residential 0.2735971 17.0
Texas Residential -0.2725538 15.0
Rhode Island Transit -0.2692428 -56.0
Idaho Retail_Recreation -0.2690135 -39.5
Maryland Residential 0.2590934 15.0
Vermont Workplace -0.2582064 -43.0
Rhode Island Grocery_Pharmacy 0.2569036 -7.5
Texas Workplace 0.2564960 -32.0
Kansas Workplace 0.2558084 -32.5
Arkansas Residential 0.2513343 12.0
Tennessee Workplace -0.2506369 -31.0
Illinois Parks 0.2494296 26.5
West Virginia Grocery_Pharmacy -0.2486513 -6.0
Nevada Retail_Recreation -0.2456530 -43.0
Missouri Residential -0.2420341 13.0
Pennsylvania Grocery_Pharmacy -0.2370331 -6.0
Georgia Grocery_Pharmacy -0.2369905 -10.0
South Carolina Parks -0.2367619 -23.0
Florida Parks -0.2362092 -43.0
Tennessee Residential 0.2354791 11.5
California Grocery_Pharmacy -0.2347472 -11.5
Wisconsin Parks 0.2324310 51.5
Utah Retail_Recreation -0.2318526 -40.0
New York Grocery_Pharmacy -0.2312768 8.0
California Retail_Recreation -0.2240406 -44.0
Michigan Workplace -0.2124572 -40.0
North Carolina Transit 0.2054355 -32.0
Texas Parks 0.2027622 -42.0
Illinois Residential 0.2027289 14.0
North Carolina Residential 0.2006910 13.0
Washington Workplace -0.2001010 -38.0
Mississippi Grocery_Pharmacy -0.1998052 -8.0
Kansas Grocery_Pharmacy -0.1988487 -14.0
Montana Transit 0.1976545 -41.0
West Virginia Workplace 0.1954170 -33.0
California Workplace -0.1949478 -36.0
Georgia Workplace -0.1940568 -33.5
North Dakota Grocery_Pharmacy -0.1869801 -8.0
Colorado Parks -0.1865491 2.0
New Jersey Residential 0.1864299 18.0
Virginia Residential 0.1798774 14.0
Iowa Transit 0.1744396 -24.0
Georgia Retail_Recreation -0.1728480 -41.0
Minnesota Parks 0.1726011 -9.0
Missouri Workplace 0.1719500 -28.0
Wisconsin Residential -0.1715829 14.0
Connecticut Parks 0.1696242 43.0
Washington Grocery_Pharmacy 0.1673897 -7.0
New Mexico Transit 0.1671515 -38.5
Florida Retail_Recreation 0.1649687 -43.0
New Hampshire Retail_Recreation -0.1597639 -41.0
Ohio Transit 0.1595453 -28.0
South Dakota Transit -0.1590838 -40.0
Kentucky Grocery_Pharmacy 0.1585269 4.0
Pennsylvania Transit -0.1567676 -42.0
Virginia Grocery_Pharmacy -0.1545969 -8.0
Georgia Residential -0.1539583 13.0
Indiana Retail_Recreation 0.1522965 -38.0
Indiana Residential 0.1497168 12.0
Massachusetts Parks 0.1488557 39.0
Mississippi Retail_Recreation -0.1475608 -40.0
Washington Residential 0.1446860 13.0
Oregon Residential 0.1436229 10.5
Alabama Parks 0.1436201 -1.0
Wisconsin Workplace -0.1382987 -31.0
Oklahoma Residential 0.1374884 15.0
Kentucky Transit -0.1358962 -31.0
North Dakota Transit 0.1317107 -48.0
Virginia Parks 0.1303478 6.0
South Dakota Residential 0.1302057 15.0
Michigan Retail_Recreation -0.1282805 -53.0
South Dakota Workplace 0.1271730 -35.0
Alabama Retail_Recreation 0.1258535 -39.0
Oregon Retail_Recreation 0.1258509 -40.5
Mississippi Transit -0.1206528 -38.5
Nebraska Transit -0.1197214 -9.0
Minnesota Retail_Recreation 0.1164756 -40.0
Massachusetts Transit -0.1161697 -45.0
New Hampshire Grocery_Pharmacy -0.1156969 -6.0
Texas Grocery_Pharmacy 0.1146307 -14.0
Ohio Parks -0.1116032 67.5
Maryland Transit -0.1091466 -39.0
Wyoming Residential 0.1081267 12.5
Texas Transit 0.1061861 -41.0
Wisconsin Grocery_Pharmacy 0.1050952 -1.0
Nebraska Retail_Recreation 0.1047198 -36.0
Arkansas Workplace -0.1042020 -26.0
Kansas Transit -0.1041789 -26.5
Idaho Parks 0.1039773 -22.0
Massachusetts Residential 0.1022862 15.0
Minnesota Workplace -0.1010738 -33.0
Ohio Residential 0.1008794 14.0
South Dakota Retail_Recreation -0.1000425 -38.5
South Carolina Residential -0.0997964 12.0
North Carolina Parks -0.0995854 7.0
Oregon Workplace -0.0995765 -31.0
Indiana Parks -0.0994544 29.0
Wyoming Retail_Recreation -0.0984797 -39.0
Georgia Parks 0.0969764 -6.0
Oklahoma Grocery_Pharmacy -0.0957155 -0.5
Nevada Parks 0.0956509 -12.5
Florida Workplace -0.0951928 -33.0
Washington Transit -0.0951588 -33.5
Oregon Transit 0.0944539 -27.5
Oklahoma Parks -0.0936217 -18.5
New York Residential 0.0935489 17.5
Arkansas Transit 0.0918282 -27.0
Maine Residential -0.0900634 11.0
Oregon Parks 0.0873147 16.5
Missouri Transit -0.0865454 -24.5
Arizona Parks -0.0861476 -44.5
Indiana Workplace 0.0860624 -34.0
Pennsylvania Residential 0.0857279 15.0
Arizona Workplace -0.0844705 -35.0
Virginia Workplace -0.0841646 -31.5
North Dakota Residential -0.0835606 17.0
Kentucky Residential 0.0808952 12.0
Michigan Residential 0.0791618 15.0
Tennessee Parks -0.0789573 10.5
Virginia Retail_Recreation -0.0776928 -35.0
Montana Retail_Recreation -0.0748847 -51.0
Ohio Grocery_Pharmacy 0.0743543 0.0
Kentucky Retail_Recreation 0.0741508 -29.0
North Carolina Retail_Recreation 0.0733861 -34.0
Mississippi Workplace -0.0730468 -33.0
Michigan Grocery_Pharmacy -0.0725914 -11.0
Colorado Transit 0.0714991 -36.0
Indiana Grocery_Pharmacy -0.0712410 -5.5
South Carolina Transit 0.0707324 -45.0
Michigan Transit 0.0694296 -46.0
West Virginia Residential -0.0693157 11.0
Montana Grocery_Pharmacy -0.0671545 -16.0
Minnesota Grocery_Pharmacy 0.0670820 -6.0
Minnesota Residential -0.0667268 17.0
Washington Parks 0.0645205 -3.5
West Virginia Transit -0.0630850 -45.0
Oklahoma Workplace 0.0614250 -31.0
Maine Grocery_Pharmacy -0.0612583 -13.0
Florida Grocery_Pharmacy 0.0586442 -14.0
Ohio Retail_Recreation 0.0578617 -36.0
Texas Retail_Recreation 0.0562035 -40.0
South Carolina Grocery_Pharmacy 0.0554044 1.0
Kentucky Workplace -0.0553265 -36.0
Utah Grocery_Pharmacy 0.0542143 -4.0
Missouri Parks 0.0528174 0.0
Mississippi Parks -0.0496484 -25.0
Iowa Grocery_Pharmacy -0.0496264 4.0
Arkansas Parks 0.0495533 -12.0
Illinois Grocery_Pharmacy -0.0444238 2.0
Alabama Residential -0.0442928 11.0
Iowa Workplace 0.0434910 -30.0
Florida Transit -0.0418300 -49.0
New Hampshire Workplace 0.0397076 -37.0
Washington Retail_Recreation -0.0396200 -42.0
South Carolina Retail_Recreation -0.0381116 -35.0
Indiana Transit 0.0377664 -29.0
Ohio Workplace -0.0325250 -35.0
Missouri Retail_Recreation -0.0315037 -36.5
Colorado Grocery_Pharmacy -0.0288057 -17.0
New Hampshire Transit -0.0276706 -57.0
South Dakota Grocery_Pharmacy 0.0274567 -9.0
Illinois Retail_Recreation 0.0253748 -40.0
Tennessee Grocery_Pharmacy 0.0239289 6.0
Colorado Retail_Recreation -0.0235058 -44.0
Kansas Residential -0.0213958 13.0
West Virginia Retail_Recreation -0.0196127 -38.5
New Mexico Workplace 0.0193793 -34.0
Kansas Retail_Recreation -0.0142568 -37.0
Iowa Retail_Recreation -0.0138232 -38.0
Iowa Residential -0.0138226 13.0
Wisconsin Retail_Recreation 0.0136665 -44.0
Oklahoma Retail_Recreation 0.0126721 -31.0
Vermont Transit -0.0109603 -63.0
Colorado Workplace 0.0109375 -39.0
Missouri Grocery_Pharmacy 0.0104194 2.0
Nevada Workplace 0.0094243 -40.0
Tennessee Retail_Recreation -0.0080096 -30.0
Oklahoma Transit 0.0072257 -26.0
Georgia Transit -0.0067451 -35.0
Arkansas Grocery_Pharmacy -0.0059777 3.0
Tennessee Transit -0.0052760 -32.0
Nevada Grocery_Pharmacy 0.0038356 -12.5
Maryland Parks -0.0022195 27.0
Nebraska Parks -0.0012086 55.5
Alaska Parks NA 29.0
District of Columbia Retail_Recreation NA -69.0
District of Columbia Grocery_Pharmacy NA -28.0
District of Columbia Parks NA -65.0
District of Columbia Transit NA -69.0
District of Columbia Workplace NA -48.0
District of Columbia Residential NA 17.0
# sanity check
ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(x=Total_confirmed_cases.per100,fill=variable))+geom_histogram()+
  facet_grid(~Province.State)+
    default_theme+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

write_plot(mobility.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.plot.png"
write_plot(mobility.global.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.global.plot.png"
(plot_data.permobility_summary.plot<-ggplot(plot_data.permobility_summary,aes(x=variable,y=median_change))+
  geom_jitter(size=2,width=.2)+
  #geom_jitter(data=plot_data.permobility_summary %>% arrange(-abs(median_change)) %>% head(n=15),aes(col=Province.State),size=2,width=.2)+
  default_theme+
  ggtitle("Per-Sate Median Change in Mobility")+
  xlab("Mobility Meaure")+
  ylab("Median Change from Baseline"))

write_plot(plot_data.permobility_summary.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/plot_data.permobility_summary.plot.png"

DELIVERABLE MANIFEST

The following link to commited documents pushed to github. These are provided as a convienence, but note this is a manual process. The generation of reports, plots and tables is not coupled to the execution of this markdown. ## Report This report, html & pdf

Plots

github_root<-"https://github.com/sbs87/coronavirus/blob/master/"

plot_handle<-c("Corona_Cases.world.long.plot",
               "Corona_Cases.world.loglong.plot",
               "Corona_Cases.world.mortality.plot",
               "Corona_Cases.world.casecor.plot",
               "Corona_Cases.city.long.plot",
               "Corona_Cases.city.loglong.plot",
               "Corona_Cases.city.mortality.plot",
               "Corona_Cases.city.casecor.plot",
               "Corona_Cases.city.long.normalized.plot",
               "Corona_Cases.US_state.lm.plot",
               "Corona_Cases.US_state.summary.plot")


deliverable_manifest<-data.frame(
  name=c("World total & death cases, longitudinal",
         "World log total & death cases, longitudinal",
         "World mortality",
         "World total & death cases, correlation",
         "City total & death cases, longitudinal",
         "City log total & death cases, longitudinal",
         "City mortality",
         "City total & death cases, correlation",
         "City population normalized total & death cases, longitudinal",
         "State total cases (select) with linear model, longitudinal",
         "State total cases, longitudinal"),
  plot_handle=plot_handle,
  link=paste0(github_root,"results/",plot_handle,".png")
)


(tmp<-data.frame(row_out=apply(deliverable_manifest,MARGIN = 1,FUN = function(x) paste(x[1],x[2],x[3],sep=" | "))))
##                                                                                                                                                                                                        row_out
## 1                                           World total & death cases, longitudinal | Corona_Cases.world.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
## 2                                 World log total & death cases, longitudinal | Corona_Cases.world.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
## 3                                                         World mortality | Corona_Cases.world.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
## 4                                      World total & death cases, correlation | Corona_Cases.world.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
## 5                                              City total & death cases, longitudinal | Corona_Cases.city.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
## 6                                    City log total & death cases, longitudinal | Corona_Cases.city.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
## 7                                                            City mortality | Corona_Cases.city.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
## 8                                         City total & death cases, correlation | Corona_Cases.city.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
## 9  City population normalized total & death cases, longitudinal | Corona_Cases.city.long.normalized.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
## 10                     State total cases (select) with linear model, longitudinal | Corona_Cases.US_state.lm.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
## 11                                      State total cases, longitudinal | Corona_Cases.US_state.summary.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png
row_out<-apply(tmp, 2, paste, collapse="\t\n")
name handle link
World total & death cases, longitudinal Corona_Cases.world.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
World log total & death cases, longitudinal Corona_Cases.world.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
World mortality Corona_Cases.world.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
World total & death cases, correlation Corona_Cases.world.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
City total & death cases, longitudinal Corona_Cases.city.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
City log total & death cases, longitudinal Corona_Cases.city.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
City mortality Corona_Cases.city.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
City total & death cases, correlation Corona_Cases.city.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
City population normalized total & death cases, longitudinal Corona_Cases.city.long.normalized.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
State total cases (select) with linear model, longitudinal Corona_Cases.US_state.lm.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
State total cases, longitudinal Corona_Cases.US_state.summary.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png

Tables

CONCLUSION

Overall, the trends of COVID-19 cases is no longer in log-linear phase for world or U.S. (but some regions like MD are still in the log-linear phase). Mortality rate (deaths/confirmed RNA-based cases) is >1%, with a range depending on region. Mobility is not a strong indicator of caseload (U.S. data).

See table below for detailed breakdown.

Question Answer
What is the effect on social distancing, descreased mobility on case load?
There is not a strong apparent effect on decreased mobility (work, grocery, retail) or increased mobility (at residence, parks) on number of confirmed cases, either as a country (U.S.) or state level. California appears to have one of the best correlations, but this is a mixed bag
What is the trend in cases, mortality across geopgraphical regions?
The confirmed total casees and mortality is overall log-linear for most countries, with a trailing off beginning for most (inlcuding U.S.). On the state level, NY, NJ, PA starting to trail off; MD is still in log-linear phase. Mortality and case load are highly correlated for NY, NJ, PA, MD. The mortality rate flucutates for a given region, but is about 3% overall.

END

End: ##—— Mon Jun 1 21:28:27 2020 ——##

Cheatsheet: http://rmarkdown.rstudio.com>

Sandbox

# Geographical heatmap!
install.packages("maps")
library(maps)
library
mi_counties <- map_data("county", "pennsylvania") %>% 
  select(lon = long, lat, group, id = subregion)
head(mi_counties)

ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()
mi_counties$cases<-1:2226
name_overlaps(metadata,Corona_Cases.US_state)

tmp<-merge(Corona_Cases.US_state,metadata)
ggplot(filter(tmp,Province.State=="Pennsylvania"), aes(Long, Lat, group = as.factor(City))) +
  geom_polygon(aes(fill = Total_confirmed_cases), colour = "grey50") + 
  coord_quickmap()


ggplot(Corona_Cases.US_state, aes(Long, Lat))+
  geom_polygon(aes(fill = Total_confirmed_cases ), color = "white")+
  scale_fill_viridis_c(option = "C")
dev.off()


require(maps)
require(viridis)

world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white")

head(world_map)
head(Corona_Cases.US_state)
unique(select(world_map,c("region","group"))) %>% filter()

some.eu.countries <- c(
  "US"
)
# Retrievethe map data
some.eu.maps <- map_data("world", region = some.eu.countries)

# Compute the centroid as the mean longitude and lattitude
# Used as label coordinate for country's names
region.lab.data <- some.eu.maps %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

unique(filter(some.eu.maps,subregion %in% Corona_Cases.US_state$Province.State) %>% select(subregion))
unique(Corona_Cases.US_state$Total_confirmed_cases.log)
ggplot(filter(Corona_Cases.US_state,Date=="2020-04-17") aes(x = Long, y = Lat)) +
  geom_polygon(aes( fill = Total_confirmed_cases.log))+
  #geom_text(aes(label = region), data = region.lab.data,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  #theme_void()+
  theme(legend.position = "none")
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
ggplot(data = world) +
    geom_sf()

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("florida", counties$ID))
counties$area <- as.numeric(st_area(counties))
#install.packages("lwgeom")
class(counties)
head(counties)
ggplot(data = world) +
    geom_sf(data=Corona_Cases.US_state) +
    #geom_sf(data = counties, aes(fill = area)) +
  geom_sf(data = counties, aes(fill = area)) +
   # scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
    coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = FALSE)


head(counties)
tmp<-unique(select(filter(Corona_Cases.US_state,Date=="2020-04-17"),c(Lat,Long,Total_confirmed_cases.per100)))
st_as_sf(map("county", plot = FALSE, fill = TRUE))

join::inner_join.sf(Corona_Cases.US_state, counties)

library(sf)
library(sp)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
class(nc)


spdf <- SpatialPointsDataFrame(coords = select(Corona_Cases.US_state,c("Lat","Long")), data = Corona_Cases.US_state,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

head(spdf)
class(spdf)
st_cast(spdf)

filter(Corona_Cases.US_state.summary,Date=="2020-04-20" & Province.State %in% top_states_modified)
id

https://stevenbsmith.net